59 research outputs found

    Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals

    Full text link
    In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g., 10--50 ms), followed by a tracking stage based on dynamic programming or a linear state-space model. One of the main disadvantages of these approaches is that the tracking stage, however good it may be, cannot improve upon the formant estimation accuracy of the first stage. The proposed TVQCP method provides a single-stage formant tracking that combines the estimation and tracking stages into one. TVQCP analysis combines three approaches to improve formant estimation and tracking: (1) it uses temporally weighted quasi-closed-phase analysis to derive closed-phase estimates of the vocal tract with reduced interference from the excitation source, (2) it increases the residual sparsity by using the L1L_1 optimization and (3) it uses time-varying linear prediction analysis over long time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal tract model and hence on the formant trajectories. Formant tracking experiments with a wide variety of synthetic and natural speech signals show that the proposed TVQCP method performs better than conventional and popular formant tracking tools, such as Wavesurfer and Praat (based on dynamic programming), the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on deep neural networks trained in a supervised manner). Matlab scripts for the proposed method can be found at: https://github.com/njaygowda/ftrac

    High-resolution three-dimensional hybrid MRI + low dose CT vocal tract modeling:A cadaveric pilot study

    Get PDF
    SummaryObjectivesMRI based vocal tract models have many applications in voice research and education. These models do not adequately capture bony structures (e.g. teeth, mandible), and spatial resolution is often relatively low in order to minimize scanning time. Most MRI sequences achieve 3D vocal tract coverage at gross resolutions of 2 mm3 within a scan time of <20 seconds. Computed tomography (CT) is well suited for vocal tract imaging, but is infrequently used due to the risk of ionizing radiation. In this cadaveric study, a single, extremely low-dose CT scan of the bony structures is blended with accelerated high-resolution (1 mm3) MRI scans of the soft tissues, creating a high-resolution hybrid CT-MRI vocal tract model.MethodsMinimum CT dosages were determined and a custom 16-channel airway receiver coil for accelerated high (1 mm3) resolution MRI was evaluated. A rigid body landmark based partial volume registration scheme was then applied to the images, creating a hybrid CT-MRI model that was segmented in Slicer.ResultsUltra-low dose CT produced images with sufficient quality to clearly visualize the bone, and exposed the cadaver to 0.06 mSv. This is comparable to atmospheric exposures during a round trip transatlantic flight. The custom 16-channel vocal tract coil produced acceptable image quality at 1 mm3 resolution when reconstructed from ∼6 fold undersampled data. High (1 mm3) resolution MR imaging of short (<10 seconds) sustained sounds was achieved. The feasibility of hybrid CT-MRI vocal tract modeling was successfully demonstrated using the rigid body landmark based partial volume registration scheme. Segmentations of CT and hybrid CT-MRI images provided more detailed 3D representations of the vocal tract than 2 mm3 MRI based segmentations.ConclusionsThe method described in this study indicates that high-resolution CT and MR image sets can be combined so that structures such as teeth and bone are accurately represented in vocal tract reconstructions. Such scans will aid learning and deepen understanding of anatomical features that relate to voice production, as well as furthering knowledge of the static and dynamic functioning of individual structures relating to voice production

    An approach to explaining formants (Story, 2024)

    No full text
    Purpose: This tutorial is a description of a possible approach to teaching the concept of formants to students in a speech science course, at either the undergraduate or graduate level. The approach is to explain formants as prominent regions of energy in the output spectrum envelope radiated at the lips, and how they arise as the superposition of vocal tract resonances on a source signal. Standing waves associated with vocal tract resonances are briefly explained and standing wave animations are provided. Animations of the temporal variation of the vocal tract, vocal tract resonances, spectra, and spectrograms, along with audio samples are included to provide dynamic demonstrations of the concept of formants.Conclusions: The explanations, accompanying demonstrations, and suggested activities are intended to provide a launching point for understanding formants and how they can be measured, analyzed, and interpreted. As a result, participants should be able to describe the meaning of the term “formant” as it relates to a spectrum and a spectrogram, explain the difference between formants and vocal tract resonances, explain how vocal tract resonances combined with the voice source generate formants, and identify formants in both narrow-band and wide-band spectrograms and track their time-varying patterns with a formant tracking algorithm.Supplemental Material S1. Standing wave in neutral vocal tract configuration for the first resonance.Supplemental Material S2. Standing wave in neutral vocal tract configuration for the second resonance.Supplemental Material S3. Standing wave in neutral vocal tract configuration for the third resonance.Supplemental Material S4. Pressure distribution in neutral vocal tract configuration at 1000 Hz, off resonance.Supplemental Material S5. Animation of the temporal variation of the components of the source-filter representation during production of “Hello, how are you.” The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S6. Audio file containing the real-time voice source signal (glottal flow wave) generated during the TubeTalker simulation of “Hello, how are you.”Supplemental Material S7. Audio file containing the real-time output pressure signal generated during the TubeTalker simulation of “Hello, how are you.”Supplemental Material S8. Animation of the temporal variation of the vocal tract in two representations during production of “Hello, how are you.” In the upper inset plot, the vocal tract is shown in tubular form, and in the main plot in the middle the vocal tract is shown in a pseudo-midsagittal form. The lower inset plot shows the simultaneous temporal variation of the frequency response function (resonances). The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S9. Animation of the temporal variation of the frequency response function in three-dimensions (time, frequency, amplitude) during production of “Hello, how are you.” There is a delay in middle of the animation to allow the viewer to see the full history and then the view rotates into a traditional spectrographic perspective. The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S10. Animation of the temporal variation of narrow-band spectra in three-dimensions (time, frequency, amplitude) during production of “Hello, how are you.” There is a delay in middle of the animation to allow the viewer to see the full history and then the view rotates into a traditional spectrographic perspective. The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Story, B. H. (2024). An approach to explaining formants. Perspectives of the ASHA Special Interest Groups. Advance online publication. https://doi.org/10.1044/2023_PERSP-23-00200</p

    A model of speech production based on the acoustic relativity of the vocal tract

    No full text
    A model is described in which the effects of articulatory movements to produce speech are generated by specifying relative acoustic events along a time axis. These events consist of directional changes of the vocal tract resonance frequencies that, when associated with a temporal event function, are transformed via acoustic sensitivity functions, into time-varying modulations of the vocal tract shape. Because the time course of the events may be considerably overlapped in time, coarticulatory effects are automatically generated. Production of sentence-level speech with the model is demonstrated with audio samples and vocal tract animations. (C) 2019 Acoustical Society of America.6 month embargo; published online: 17 October 2019This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
    corecore